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(54) Method and apparatus for reaching agreement between nodes in a distributed system 

(57) One embodinnent of the present Invention pro- 
vides a system for selecting a node to host a primary 
server for a service from a plurality of nodes in a distrib- 
uted computing system. The system operates by receiv- 
ing an indication that a state of the distributed 
computing system has changed. In response to this 
indication, the system detemnines if there is already a 
node hosting the primary server for the service. If not, 
the system selects a node to host the primary server 
using the assumption that a given node from the plural- 
ity of nodes in the distributed computing system hosts 
the primary server. The system then communicates 
rank information between the given node and other 
nodes in the distributed computing system, wherein 
each node in the distributed computing system has a 
unique rank with respect to the other nodes in the dis- 
tributed computing system. The system next compares 
the rank of the given node with the rank of the other 
nodes in the distributed computing system. If one of the 
other nodes has a higher rank than the given node, the 
system disqualifies the given node from hosting the pri- 
mary server. 
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Description 
Related Application 

[0001] This application hereby clainns priority under 
35 U.S.C. § 1 19 to U. S. Provisional Patent Application 
No. 60/1 60,992 filed on October 21 , 1 999, entitled "Dis- 
tributed Multi-Tier Mechanism for Agreement." 

BACKGROUND 

Field of the Invention 

[0002] The present invention relates to coordinating 
activities between nodes in a distributed computing sys- 
tem. More specifically, the present invention relates to a 
method and an apparatus for reaching agreement 
between nodes in the distributed computing system 
regarding a node to function as a primary provider for a 
service. 

Related Art 



[0003] As computer networks are increasingly used 
to link computer systems together, distributed comput- 
ing systems have been developed to control interactions 
between computer systems. Some distributed comput- 
ing systems allow client computer systems to access 
resources on server computer systems. For example, a 
client computer system may be able to access infomna- 
tion contained in a database on a server computer sys- 
tem. 

[0004] When a server computer system fails, it is 
desirable for the distributed computing system to auto- 
matically recover from this failure. Distributed computer 
systems possessing an ability to recover from such 
server failures are referred to as "highly available sys- 
tems." 

[0005] For a highly available system to function 
properly, the highly available system must be able to 
. detect a server failure and reconfigure itself so that 
accesses to a failed server are redirected to a backup 
secondary server. 

[0006] One problem in designing such a highly 
available system is that some distributed computing 
system functions must be centralized in order to operate 
efficiently. For example, it is desirable to centralize an 
arbiter that keeps track of where primary and secondary 
copies of a server are located in a distributed computing 
system. However, a node that hosts such a centralized 
arbiter may itself fail. Hence, it is necessary to provide a 
mechanism to select a new node to host the centralized 
arbiter. 

[0007] Moreover, this selection mechanism must 
operate in a distributed fashion because, for the rea- 
sons stated above, no centralized mechanism is certain 
to continue functioning. Furthermore, it is necessary for 
the node selection process to operate so that the nodes 



that remain functioning in the distributed computing sys- 
tem agree on the same node to host the centralized 
arbiter. For efficiency reasons, it is also desirable for the 
node selection mechanism not to move the centralized 
5 arbiter unless it is necessary to do so. 

[0008] Hence, what is needed is a method and an 
apparatus that operates in a distributed manner to 
select a node to host a primary server for a service. 

10 SUI\nMARY 

[0009] One embodiment of the present invention 
provides a system for selecting a node to host a primary 
server for a service from a plurality of nodes in a distrib- 
15 uted computing system. The system operates by receiv- 
ing an indication that a state of the distributed 
computing system has changed. In response to this 
indication, the system determines if there is already a 
node hosting the primary server for the service. If not, 
20 the system selects a node to host the primary server 
using the assumption that a given node from the plural- 
ity of nodes in the distributed computing system hosts 
the primary server. The system then communicates 
rank information between the given node and other 
25 nodes in the distributed computing system, wherein 
each node in the distributed computing system has a 
unique rank with respect to the other nodes in the dis- 
tributed computing system, fhe system next compares 
the rank of the given node with the rank of the other 
30 nodes in the distributed computing system. If one of the 
other nodes has a higher rank than the given node, the 
system disqualifies the given node from hosting the pri- 
mary server. 

[0010] In one embodiment of the present invention, 
35 if there exists a node to host the primary server, the sys- 
tem allows the node that hosts the primary server to 
communicate with other nodes in the distributed com- 
puting system in order to disqualify the other nodes from 
hosting the primary server. 
40 [001 1 ] In one embodiment of the present invention, 
the system maintains a candidate variable in the given 
node identifying a candidate node to host the primary 
server. In a variation on this embodiment, the system 
initially sets the candidate variable to identify the given 
45 node. 

[0012] In one embodiment of the present invention, 
after a new node has been selected to host the primary 
server, if the new node is different from a previous node 
that hosted the primary server, the system maps con- 
so nections for the service to the new node. In a variation 
on this embodiment, the system also configures the 
new node to host the primary server for the service. 
[0013] In one embodiment of the present invention, 
the system restarts the service if the service was inter- 
55 rupted as a result of the change in state of the distrib- 
uted computing system. 

[0014] In one embodiment of the present invention, 
the given node in the distributed computing system can 
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act as one of: a host for the primary server for the serv- 
ice; a host for a secondary server for the service, 
wherein the secondary server periodically receives 
checkpointing information from the primary server; or a 
spare for the primary server, vy^herein the spare does not 5 
receive checkpointing information from the primary 
server. 

[0015] In one embodiment of the present invention, 
upon initial startup of the service, the system selects a 
highest ranking spare to host the primary server for the 10 
service. 

[0016] In one embodiment of the present invention, 
the system allows the primary server to configure 
spares in the distributed computing system to host sec- 
ondary servers for the service. 75 
[0017] In one embodiment of the present invention, 
comparing the rank of the given node with the rank of 
the other nodes in the distributed computing system 
involves considering a host for a secondary server to 
have a higher rank than a spare. 20 
[0018] In one embodiment of the present invention, 
after disqualifying the given node from hosting the pri- 
mary server, the system ceases to communicate rank 
information between the given node and the other 
nodes in the distributed computing system. 25 

BRIEF DESCRIPTION OF THE FIGURES 

[0019] 

30 

FIG. 1 illustrates a distributed computing, system in 
accordance with an embodiment of the present 
invention. 

FIG. 2 illustrates how highly available services are 
controlled within a distributed computing system in 35 
accordance with an embodiment of the present 
invention. 

FIG. 3 illustrates how a replica managers controls 
highly available services in accordance with an 
embodiment of the present invention. 4o 
FIG. 4 is a flow chart illustrating the process of 
selecting and configuring a new primary server in 
accordance with an embodiment of the present 
invention. 

FIG. 5 is a flow chart illustrating some of the opera- 45 
tions performed by a primary server in accordance 
with an embodiment of the present Invention. 
FIG. 6 illustrates how a node is selected to host a 
primary server through a disqualification process in 
accordance with an embodiment of the present so 
invention. 

FIG. 7 illustrates how nodes a disqualified in 
accordance with an embodiment of the present 
invention. 

55 

DETAILED DESCRIPTION 

[0020] The following description is presented to 
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enable any person skilled in the art to make and use the 
invention, and is provided in the context of a particular 
application and its requirements. Various modifications 
to the disclosed embodiments will be readily apparent to 
those skilled in the art, and the general principles 
defined herein may be applied to other embodiments 
and applications without departing from the spirit and 
scope of the present invention. Thus, the present inven- 
tion is not intended to be limited to the embodiments 
shown, but is to be accorded the widest scope consist- 
ent with the principles and features disclosed herein. 
[0021] The data structures and code described in 
this detailed description are typically stored on a com- 
puter readable storage medium, which may be any 
device or medium that can store code and/or data for 
use by a computer system. This includes, but is not lim- 
ited to. magnetic and optical storage devices such as 
disk drives, magnetic tape, CDs (compact discs) and 
DVDs (digital versatile discs or digital video discs), and 
computer instruction signals embodied in a transmis- 
sion medium (with or without a carrier wave upon which 
the signals are modulated). For example, the transmis- 
sion medium may include a communications network, 
such as the Internet. 

Distributed Computing System 

[0022] FIG. 1 illustrates a distributed computing 
system 100 in accordance with an embodiment of the 
present invention. Distributed computing system 100 
includes a number of computing nodes 102-105, which 
are coupled together through a network 110. 
[0023] Network 1 1 0 can include any type of wire or 
wireless communication channel capable of coupling 
together computing nodes. This includes, but is not lim- 
ited to, a local area network, a wide area network, or a 
combination of networks. In one embodiment of the 
present invention, network 110 includes the Internet. In 
another embodiment of the present invention, network 
110 is a local high speed network that enables distrib- 
uted computing system 100 to function as a clustered 
computing system (hereinafter referred to as a •clus- 
ter"). 

[0024] Nodes 102-105 can generally include any 
type of computer system, including, but not limited to. a 
computer system based on a microprocessor, a main- 
frame computer, a digital signal processor, a personal 
organizer, a device controller, and a computational 
engine within an appliance. 

[0025] Nodes 102-105 also host servers, which 
include a mechanism for servicing requests from a cli- 
ent for computational and/or data storage resources. 
More specifically, node 102 hosts primary server 106, 
which services requests from clients (not shown) for a 
service involving computational and/or data storage 
resources. 

[0026] Nodes 1 03-1 04 host secondary servers 1 07- 
108, respectively, for the same service. These second- 



3 



BNSDOCID: <EP 1 096751 A2J_> 



5 



EP 1 096 751 A2 



ary servers act as backup servers for primary server 
106. To this end, secondary servers 107-108 receive 
periodic checkpoints 120-121 from primary server 106. 
These periodic checkpoints enable secondary servers 
107-108 to maintain consistent state with primary 
server 106. This makes it possible for one of secondary 
servers 107-108 to take over for primary server 106 if 
primary server 1 06 fails. 

[0027] Node 1 05 can serve as a spare node to host 
the service provided by primary server 106. Hence, 
node 105 can be configured to host a secondary server 
with respect to a service provided by primary server 
106. Alternatively, if all primary servers and secondary 
servers for the service fail, node 1 05 can be configured 
to host a new primary server for the service. 
[0028] Also note that nodes 102-105 contain dis- 
tributed selection mechanisms 132-135, respectively 
Distributed selection mechanisms 132-135 communi- 
cate with each other to select a new node to host pri- 
mary server 106 when node 102 fails or otherwise 
becomes unavailable. This process is described in more 
detail below with reference to FIGs. 2-6. 

Controlling Highly Available Services 

[0029] FIG. 2 illustrates how highly available serv- 
ices 202-205 are controlled within distributed computing 
system 1 00 in accordance with an embodiment of the 
present invention. Note that highly available services 
202-205 continue to operate even if individual nodes of 
distributed computing system 100 fail. 
[0030] Highly available services 202-205 operate 
under control of repltea manager 206. Referring to FIG. 
3, for each service, replica manager 206 keeps a record 
of which nodes in distributed computing system 100 
function as primary servers, and which nodes function 
as secondary servers. For example, in FIG. 3 replica 
manager 206 keeps track of highly available services 
202-205. The primary server for service 202 is node 
1 03, and the secondary servers are nodes 1 04 and 1 05. 
The primary server for service 203 is node 1 04, and the 
secondary servers are nodes 1 03 and 1 05. The primary 
server for service 204 is node 1 02, and the secondary 
servers are nodes 104-105. The primary server for 
service 205 is node 1 03, and the secondary servers are 
nodes 102, 104 and 105. 

[0031] Replica manager 206 additionally performs 
a number of related functions, such as configuring a 
node to host a primary (which may involve demoting a 
current host for the primary to host a secondary). Rep- 
lica manager 206 may additionally perform other func- 
tions, such as: adding a service; removing providers for 
a service; registering providers for a servk^e; removing 
a service; handling provider failures; bringing up new 
providers for a service; and handling dependencies 
between services {which may involve ensuring that pri- 
maries for dependent services are co- located on the 
same node). 



[0032] Referring back to FIG. 2, replica manager 
206 is itself a highly available service that operates 
under control of replica manager manager (RMM) 208. 
Note that RMM 208 is not managed by a higher level 
5 service. 

[0033] As illustrated in FIG. 2, RMM 208 communi- 
cates with cluster membership monitor (CMM) 210. 
CMM 210 monitors cluster membership and alerts 
RMM 208 if any changes in the cluster membership 
10 occur. 

[0034] CMM 210 uses transport layer 212 to 
exchange messages between nodes 102-105. 

Process of Selecting a New Primary 

75 

[0035] FIG. 4 is a flow chart illustrating the process 
of selecting and configuring a primary server in accord- 
ance with an embodiment of the present invention. Note 
that this process is run concurrently by each active node 

20 in distributed computing system 100. The system 
begins by receiving an indication from CMM 210 that 
the membership in the cluster has changed (step 401). 
[0036] In response to this indication, the system 
obtains a lock on a local candidate variable which con- 

25 tains an identifier for a candidate node to host primary 
server 1 06 (step 402). The system also obtains an addi- 
tional lock to hold off requesters for the service (step 
404). 

[0037] Next, the system executes a disqaulification 
30 process by communicating with other nodes in distrib- 
uted computing system 100 in order to disqualify the 
other nodes from acting as the primary server 1 06 (step 
406). This process is described in more detail with refer- 
ence to FIG. 6 below. 
35 [0038] After the disqualification process, the 
remaining node, which is not disqualified, becomes the 
primary node. If the node hosting primary server 106 
has changed, this may involve re-mapping connections 
for the service to point to the new node (step 408). It 
40 may also involve initializing the new node to act as the 
host for the primary (step 41 0). 

[0039] Finally, the service is started (step 412). This 
may involve unfreezing the service if it was previously 
frozen, as well as releasing the previously obtained lock 

45 that holds off requesters for the service. 

[0040] FIG. 5 is a flow chart illustrating some of the 
operations performed by a primary server 106 in 
accordance with an embodiment of the present inven- 
tion. During operation, primary server 106 performs 

50 periodic checkpointing operations 1 20-1 21 with second- 
ary servers 107-108, respectively (step 502). These 
checkpointing operations allow secondary servers 1 07- 
108 to take over from primary server 106 if primary 
server 106 fails. Primary server 106 also periodically 

55 attempts to promote spare nodes (such as node 1 05 in 
FIG. 1) to host secondaries (step 504). This promotion 
process involves transferring state information to a 
spare node in order to bring the spare node up to date 
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with respect to the existing secondaries. 
[0041] FIG. 6 illustrates how a node is selected to 
host primary server 106 through a disqualification proc- 
ess in accordance with an ennbodiment of the present 
invention. Note that FIG. 6 describes in nnore detail the 5 
process described above with reference to step 406 in 
FIG. 4. 

[0042] The system starts by determining if a node 
that was previously hosting primary server 1 06 contin- 
ues to exist (step 602). io 
[0043] if not, the system retrieves the state of a 
local provider for the service (step 604). The system 
then sets the candidate variable to identify the local pro- 
vider (step 606), and subsequently unlocks the candi- 
date lock that was set previously in step 402 of FIG. 4 is 
(step 608). Next, if the candidate is not the local pro- 
vider, the system ends the process (step 610). 
[0044] Next, for all other nodes I in the cluster, the 
system attempts to disqualify node I by writing a new 
identifier into the candidate variable for node I if the rank 20 
of node I is less than the rank of the present node (step 
61 2). This process is described in more detail with refer- 
ence to FIG. 7 below. Finally, If node I's local provider 
has a higher rank than the present node, the process 
terminates because the present node is disqualified 25 
(step 614). 

[0045] Note that a rank of a node can be obtained 
by comparing a unique identifier for the node with 
unique identifiers for other nodes. Also note that the 
rank of a primary server is greater than the rank of a so 
secondary server, and that the rank of a secondary 
server is greater than the rank of a spare. The above- 
listed restrictions on rank ensure that an existing pri- 
mary that has not failed continues to function as the pri- 
mary, and that an existing secondary will be chosen 35 
ahead of a spare. Of course, when the system is initial- 
ized, no primaries or secondaries exist, so a spare is 
selected to be the primary. 

[0046] On the other hand, if the node that was host- 
ing primary server 1 06 continues to function, the system 40 
sets the candidate to be this node (step 616), and 
unlocks the candidate node (step 618). 
[0047] If the present node does not host primary 
server 106, the process is finished. Otherwise, if the 
present node is hosting primary server 1 06, the system 45 
considers each other node I in the cluster If the present 
node has already communicated with I, the system 
skips node 1 (step 622). Otherwise, the system commu- 
nicates with node I in order to disqualify node I from act- 
ing as the host for primary server 106 (step 624). This so 
may involve causing an identifier for the present node to 
be written into the candidate variable for node I. 
[0048] FIG. 7 illustrates how nodes are disqualified 
in accordance with an embodiment of the present inven- 
tion. Note that FIG. 7 describes in more detail the proc- ss 
ess described above with reference to step 612 in FIG. 
6. The caller first locks the candidate variable for node I 
(step 702). If the caller determines that the caller's pro- 



vider has a higher rank than is specified in the candidate 
variable for I, the caller overwrites the candidate varia- 
ble for I with the caller's provider (step 704). Next, the 
caller unlocks the candidate variable for I (step 706). 
[0049] The foregoing descriptions of embodiments 
of the invention have been presented for purposes of 
illustration and description only. They are not intended 
to be exhaustive or to limit the present invention to the 
forms disclosed. Accordingly, many modifications and 
variations will be apparent to practitioners skilled in the 
art. Additionally, the above disclosure is not intended to 
limit the present invention. The scope of the present 
invention is defined by the appended claims. 

Claims 

1 . A method for selecting a node 1 02, 1 03, 1 04, 1 05 to 
host a primary server 1 06 for a service from a plu- 
rality of nodes in a distributed computing system 
100, the method comprising: 

a) receiving 401 an indication that a state of the 
distributed computing system has changed; 

b) in response to the indication, determining 
602 if there is already a node hosting the pri- 
mary server for the service; and 

c) if there Is not already a node hosting the pri- 
mary server, selecting 604 a node to host the 
primary server based upon rank information for 
the nodes. 

2. The method of claim 1 , wherein selecting the node 
to host the primary server involves: 

a) assuming that a given node from the plurality 
of nodes in the distributed computing system 
hosts the primary server, 

b) communicating 612 rank information 
between the given node and other nodes in the 
distributed computing system, wherein each 
node in the distributed computing system has a 
unique rank with respect to the other nodes in 
the distributed computing system, 

c) comparing a rank of the given node with a 
rank of the other nodes in the distributed com- 
puting system, and 

d) if one of the other nodes in the distributed 
computing system has a higher rank than the 
given node, disqualifying 614 the given node 
from hosting the primary server. 

3. The method of claim 2, further comprising, if there 
exists a node that is configured to host the primary 
server, allowing 622 the node that is configured to 
host the primary server to communicate with other 
nodes in the distributed computing system in order 
to disqualify 624 the other nodes from hosting the 
primary server. 



BNSDOCID: <EP 1 096751 A2 I > 



9 



EP 1 096 751 A2 



4. The method of claim 2, wherein assuming that the 
given node hosts the primary server involves: 

a) maintaining a candidate variable in the given 
node identifying a candidate node to host the 5 
primary server; and 

b) initially setting the candidate variable to 
identify the given node. 



5. The method of claim 1, further comprising, after a 
new node has been selected to host the primary 
server, if the new node is different from a previous 
node that hosted the primary server, establishing 
408 connections for the service to the new node. 

6. The method of claim 1, further comprising, after a 
new node has been selected to host the primary 
server, If the new node is different from a previous 
node that hosted the primary server, configuring 
the new node to host the primary server for the 
service. 

7. The method of claim 1, further comprising restart- 
ing the service if the service was interrupted as a 
result of the change in state of the distributed com- 
puting system. 

8. The method of claim 2, wherein the given node in 
the distributed computing system acts a one of: 

a) a host for the primary server 106 for the 
service; 

b) a host for a secondary server 107, 108 for 
the service, wherein the secondary server peri- 
odically receives checkpointing information 
from the primary server; and 

c) a spare for the primary server, wherein the 
spare does not receive checkpointing informa- 
tion from the primary server. 

9. The method of claim 8, further comprising, upon ini- 
tial startup of the service, selecting a highest rank- 
ing spare to host the primary server for the service. 



given node from hosting the primary server involves 
ceasing to communicate rank information between 
the given node and the other nodes in the distrib- 
uted computing system. 

13. A computer-readable storage medium storing 
instructions that when executed by a computer 
cause the computer to perform the method steps of 
any one of claims 1 to 1 2. 

10 

14. A computer program, which when run on a compu- 
ter, is adapted to perfonn the method steps of any 
one of claims 1 to 12. 

15 15. An apparatus that selects a node to host a primary 
server for a service from a plurality of nodes in a 
distributed computing system, the apparatus com- 
prising: 

20 a) a receiving mechanism 401 that is config- 

ured to receive an Indication that a state of the 
distributed computing system has changed; 

b) a determination mechanism 602 that is con- 
figured to determine if there is already a node 

25 hosting the primary server for the service in 

response to the indication; 

c) a selecting mechanism 604, wherein if there 
is not already a node hosting the primary 
server, the selecting mechanism is configured 

30 to select a node to host the primary server 

based upon rank information for the nodes. 

16. The apparatus of claim 15, wherein, in selecting a 
node to host the primary server based upon rank 

35 information, the selecting mechanism is configured 
to: 

a) communicate rank information between the 
given node and other nodes in the distributed 

40 computing system, wherein each node in the 

distributed computing system has a unique 
rank with respect to the other nodes in the dis- 
tributed computing system, and to 

b) compare a rank of the given node with a rank 
of the other nodes in the distributed computing 
system. 

17. The apparatus of claim 1 6, further comprising a dis- 
qualification mechanism that is configured to dis- 
qualify the given node from hosting the primary 
server if one of the other nodes in the distributed 
computing system has a higher rank than the given 
node. 

18. The apparatus of claim 16, further comprising a 
mechanism on the primary server that is configured 
to communicate with other nodes in the distributed 
computing system in order to disqualify the other 



10. The method of claim 8, further comprising allowing 45 
the primary server to configure 504 spares in the 
distributed computing system to host secondary 
servers for the service. 

1 1 . The method of claim 8. wherein comparing the rank so 
of the given node with the rank of the other nodes in 
the distributed computing system involves consid- 
ering a host for the primary server to have a higher 
rank than a host for a space, and considering a host 
for a secondary server to have a higher rank than a 55 
spare. 

12. The method of claim 2, wherein disqualifying the 
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nodes from hosting the primary server. 

19. The apparatus of claim 16, wherein the selecting 
mechanism is configured to: 

5 

a) maintain a candidate variable in the given 
node identifying a candidate node to host the 
primary server; and to 

b) initially set the candidate variable to identify 

the given node. io 

20. The apparatus of claim 15, further comprising a 
connection mechanism that is configured to estab- 
lish connections for the service to a new node after 
the new node has been selected to host the primary 75 
server, and if the new node is different from a previ- 
ous node that hosted the primary server 

21. The apparatus of claim 15, further comprising a 
mechanism that configures a new node to host the 20 
primary server for the service, after the new node 
has been selected to host the primary server, and if 
the new node is different from a previous node that 
hosted the primary server. 

25 

22. The apparatus of claim 15, further comprising a 
restarting mechanism that is configured to restart 
the service if the service was interrupted as a result 
of the change in state of the distributed computing 
system. 30 

23. The apparatus of claim 1 6, wherein the given node 
in the distributed computing system acts a one of: 

a) a host for the primary server for the service; 35 

b) a host for a secondary server for the service, 
wherein the secondary server periodically 
receives checkpointing information from the 
primary server; and 

c) a spare for the primary server, wherein the 40 
spare does not receive checkpointing informa- 
tion from the primary server. 

24. The apparatus of claim 23, further comprising an 
initialization mechanism wherein during initializa- 45 
tion of the service, the initialization mechanism is 
configured to select a highest ranking spare to host 
the primary server for the service. 

25. The apparatus of claim 23, further comprising a so 
promotion mechanism on the primary server that 
that is configured to promote spares in the distrib- 
uted computing system to host secondary servers 

for the service. 

55 

26. The apparatus of claim 23, wherein while compar- 
ing the rank of the given node with the rank of the 
other nodes in the distributed computing system, 



the selecting mechanism is configured to consider 
a host for the primary server to have a higher rank 
than a host for a secondary server, and to consider 
a host for a secondary server to have a higher rank 
than a spare. 

27. The apparatus of claim 16, wherein the selecting 
mechanism is configured to cease to communicate 
rank information between the given node and the 
other nodes in the distributed computing system 
after the given node is disqualrfied by the disqualifi- 
cation mechanism. 

28. A method for selecting a node to host a primary 
server for a service from a plurality of nodes in a 
distributed computer system, comprising: 

a) communicating disqualification information 
between the node and remaining nodes in the 
plurality of nodes; 

b) disqualifying the node from hosting the pri- 
mary server based upon the disqualification 
information received from the remaining nodes. 

29. The method of claim 28, wherein the disqualifica- 
tion information comprises a node rank information. 

30. The method of claim 29, wherein the node rank for 
a given node is calculated using an assumption that 
the given node hosts the primary server. 

31. The method of claim 30, wherein the calculated 
node rank is unique with respect to the ranks of 
other nodes in the distributed computer system. 

32. The method of claim 29, wherein the disqualifying 
of the node comprises: 

a) comparing a rank of the node to a set of 
ranks of the remaining nodes in the distributed 
computer system; and 

b) disqualifying the node from hosting the pri- 
mary server if one of the set of ranks of the 
remaining nodes is higher than the rank of the 
node. 

33. The method of claim 28, further comprising repeat- 
ing the acts of communicating disqualification infor- 
mation and disqualifying the node for at least one 
more node in the plurality of nodes. 
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